Goto

Collaborating Authors

 1-bit quantization


0e230b1a582d76526b7ad7fc62ae937d-AuthorFeedback.pdf

Neural Information Processing Systems

More extensive and thorough experiments are needed. Sub 1-bit quantization is only available through FleXOR. Or do some weights use >1b while other can use much less? The reviewer did not find results in the paper that used quantized inputs. "Input weight format" should read "Internal weight format."


A Additional Related Work KD has been extensively applied to computer vision and NLP tasks [ 52

Neural Information Processing Systems

Examples of such knowledge definitions include output logits (e.g., DistilBERT [ We have defined the three types of KD: 1S-KD, 2S-KD, and 3S-KD in main text. Here we explain in more details. One-Stage KD means we naively minimize the sum of teacher-student differences on hidden-states, attentions and logits. Finally, Three-Stage KD succeeds the properties of 1S-KD and 2S-KD. See Table C.1 for the size of each (augmented) We overall find the standard deviation are within 0.1 on the GLUE score.




CommVQ: Commutative Vector Quantization for KV Cache Compression

Li, Junyan, Zhang, Yang, Hassan, Muhammad Yusuf, Chafekar, Talha, Cai, Tianle, Ren, Zhile, Guo, Pengsheng, Karimzadeh, Foroozan, Reed, Colorado, Wang, Chong, Gan, Chuang

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as context grows. To address this, we propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference. We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache, which can be decoded via simple matrix multiplication. To further reduce computational costs during decoding, we design the codebook to be commutative with Rotary Position Embedding (RoPE) and train it using an Expectation-Maximization (EM) algorithm. This enables efficient integration of decoding into the self-attention mechanism. Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook. Experiments on long-context benchmarks and GSM8K show that our method reduces FP16 KV cache size by 87.5% with 2-bit quantization, while outperforming state-of-the-art KV cache quantization methods. Notably, it enables 1-bit KV cache quantization with minimal accuracy loss, allowing a LLaMA-3.1 8B model to run with a 128K context length on a single RTX 4090 GPU. The source code is available at: https://github.com/UMass-Embodied-AGI/CommVQ.


Frame Quantization of Neural Networks

Czaja, Wojciech, Na, Sanghoon

arXiv.org Machine Learning

Quantization is the process of compressing input from a continuous or large set of values into a small-sized discrete set. It gained popularity in signal processing, where one of its primary goals is obtaining a condensed representation of the analogue signal suitable for digital storage and recovery. Examples of quantization algorithms include truncated binary expansion, pulse-code modulation (PCM) and sigma-delta (Σ) quantization. Among them, Σ algorithms stand out due to their theoretically guaranteed robustness. Mathematical theories were developed in several seminal works [3-5, 8, 11], and have been carefully studied since, e.g., [14, 15, 19, 27]. In recent years, the concept of quantization also captured the attention of the machine learning community. The quantization of deep neural networks (DNNs) is considered one of the most effective network compression techniques [9]. Computers express parameters of a neural network as 32-bit or 64-bit floating point numbers.